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We have analyzed the generaUzation performance of a student which slowly switches en- 
semble teachers. By calculating the generalization error analytically using statistical me- 

' chanics in the framework of on-line learning, we show that the dynamical behaviors of gen- 
p ; 

' I ' eralization error have the periodicity that is synchronized with the switching period and 

Q ' the behaviors differ with the number of ensemble teachers. Furthermore, we show that the 

O 



smaller the switching period is, the larger the difference is. 
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Learning can be classified into batch learning and on-line learning. ' In on-line learning, 



\ examples once used are discarded and a student cannot give correct answers for all examples 

CN . used in training. However, there are merits; for example, a large memory for storing many 

• examples is not necessary and it is possible to follow a time variant teacher.^''' Recently, 

lO ' we used a statistical mechanical method^' ^ to analyze the generalization performance of a 

o ; 

OO model composed of linear perceptrons: a true teacher, ensemble teachers, and the student in 

O ■ fi 

the framework of on-line learning. That is, we treated a model that has K teachers called 

^ ■ ensemble teachers who exist around a true teacher.'^ In the study, we analyzed the model 

' in which a student switches the ensemble teachers in turn or randomly at each time step. 

Therefore, the study was an analysis of a fast switching model. On the contrary, the properties 
of a model in which a student switches the ensemble teachers slowly is also attractive. In this 
letter, we analyze such a slow switching model. 

We have considered a true teacher, K ensemble teachers, and a student. They are all linear 
perceptrons with connection weights A, B^, and J, respectively. Here, k = 1,...,K. For 
simplicity, the connection weight of the true teacher, the ensemble teachers, and the student 
is simply called the true teacher, the ensemble teachers, and the student, respectively. The true 
teacher A = {Ai, . . . , ^at), ensemble teachers Bk = {Bki, ■ ■ ■ , BkN), student J = ( Ji, . . . , Jjv), 
and input x = (xi, . . . ,xn) are A^-dimensional vectors. Each component Ai of A is drawn 
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from J\f{0, 1) independently and fixed, where J\f{0, 1) denotes the Gaussian distribution with 
a mean of zero and a varianee of unity. Some components Bki are equal to Ai multiplied by 
-1, the others are equal to Ai. Whieh component B/^i is equal to — Ai is independent from the 
value of Ai. Hence, B^i also obeys J\f{0, 1) and it is also fixed. The direction cosine between 
Sjfc and A is R^^ Si,nd that between and Bi^i is Qkk'- Each of the components jf of the 
initial value of J is drawn from AA(0, 1) independently. The direction cosine between J 
and A is Rj and that between J and -B^ is Rbuj- Each component Xi of x is drawn from 
N{Q, 1 /N) independently. 

This letter assumes the thermodynamic limit N oo. Therefore, ||A|| = ||-Bfc|| = \\J^\\ = 
VN, and ||cc|| = 1. Generally, norm || J|| of the student changes as time step proceeds. There- 
fore, ratio of the norm to \/iV is introduced and called the length of the student. That 
is, IIJ^'"!! = 1"^^/N, where m denotes the time step. The outputs of the true teacher, the 
ensemble teachers, and the student are y^+n^, v^ + n^i^ and u"^l"^ + ny , respectively. Here, 

= A • x"^, = Bk ■ a;™-, and u™"/™" = J™" • x'^ where y™, f™, and u™" obey Gaussian 
distributions with a mean of zero and a variance of unity, n™, ra^'^, and nJj are independent 
Gaussian noises with variances of f^jCr^^, and (Tj, respectively. 

We define the error esk between true teacher A and each member Bk of the ensemble 
teachers by the squared errors of their outputs: e^'^, = ^ (y™ + — — "n.^^,)^. In the same 
manner, we define error eskJ between each member Bk of the ensemble teachers and student 
J by the squared errors of their outputs: e^^./ = \ {v^ ~^ ^Bk ^ u^H"^ — n™)^. Student J 
adopts the gradient method as a learning rule and uses input x and an output of one of 
the K ensemble teachers B/.. Here, the student J uses each ensemble teacher B/,. TN times 
succsessively where T is 0(1). That is, 

= J'^-v^ (1) 
= J'^ + r]{v^ + n'Bk - ^'"r - nj) x"^, (2) 



k = mod 



m 



™j.A-)+1, (3) 

where ry denotes the learning rate and is a constant number. The Gauss notation is denoted by 
[•]. That is, [t^] is the maximum integer which is not larger than Here, mod ( [t^] , i^) 
denotes the remainder of [7^] divided by K. Equation (3) means that the student uses each 
ensemble teacher TN ^ 0{N) times succsessively. We call this slow switching. By generalizing 
the learning rules, Eq. (2) can be expressed as J™+i = J"^ + fj^x"^ , where / denotes a function 
that represents the update amount and is determined by the learning rule. In addition, we 
define the error ej between true teacher A and student J by the squared error of their outputs: 

One of the goals of statistical learning theory is to theoretically obtain generalization 
errors. Since generalization error is the mean of errors for the true teacher over the distribution 
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of new input and noises, generalization error eskg of each member Bk of the ensemble teachers 
and ejg of student J are calculated as follows. Superscripts m, which represent the time step, 
are omitted for simplicity unless stated otherwise. 

EBkg = J dxdnAdnBkP ix,nA,nBk) ^Bk (4) 

= J dydvkdnAdnBkP{y,Vk,nA,nBk)^{y + nA-Vk-nBkf (5) 

= \{-2RBk + 2 + a\ + alk), (6) 
ejg = J dxduAdnjP {x,nA,nj) ej (7) 

= J dydudnAdnjP{y,u,nA,nj)^{y + nA-ul-nj)'^ (8) 

= ^{-2lRj + l'' + l + a\ + aj). (9) 

To simplify the analysis, two auxiliary order parameters rj = IRj and VBk.j = ^R-Ekj are 
introduced. Simultaneous differential equations in deterministic forms, ^ which describe the 
dynamical behaviors of order parameters when the student uses a teacher B^' that consists 
of ensemble teachers have been obtained on the basis of self-averaging in the thermodynamic 
limits as follows: 

^ = (A-.), ^ = (/«>, f = (/..> + ^(/J). (10) 

Here, dimension N has been treated to be sufficiently greater than the number K of ensemble 
teachers. Time is defined by t = m/N, that is, time step m normalized by dimension N. Since 
linear perceptrons are treated in this letter, the sample averages that appeared in the above 
equations can be easily calculated as follows: 

ifk'u) = 77 - /) , (/|.) = (/2 - 2rBk'J + 1 + + ^j) , (H) 

ifk'V) = ri {RBk' - rj) , ifk'Vk) = V {Qk'k - rBkj) ■ (12) 

Let us denote the values of rj,rBkJ, and P of t = to as i^y,i^^BkJ^ ^^'^ respectively. 
By using these as intitial values, simultaneous differential equations Eqs.(10)-(12) can be 
solved analytically as follows: 

rskJ = qk'k +{r%j- Qk'k) e-'^^'-'^l (13) 

rj = RBk' + - RBk') e-''^'-">\ (14) 

I' = l + ^{alk'+crj)+2{/^,,j-l)e-^(^-^^) 

+ - 1 - ^ i^Bk' + - 2 (4V. - 1)) e''(^-^)(*-*o). (15) 

Since all components Ai and J? of true teacher A and the initial student are drawn 
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from M{0, 1) independently, and because the thermodynamic hmit ^ oo is also assumed, 
they are orthogonal to each other at t = 0. That is, Rj = and I = 1 when t = 0. 

In the following, we consider the case where direction cosines Rsk between the ensemble 
teachers and the true teacher, direction cosines (7^^/ among the ensemble teachers and variances 
^Bk noises of ensemble teachers are uniform. That is, 

RBk = RB,{k = l,...,K), qkk' = \ ^' , (^Bk = (^B- (16) 

[ q, (otherwise), 

The dynamical behaviors of generalization errors ejg have been analytically obtained by 
substituting Eqs. (14) and (15) into Eq. (9). The analytical results and the corresponding 
simulation results, where N = 10^ are shown in Figs. 1 and 2. In computer simulations, ejg 
was obtained by averaging the squared errors for 5 x 10^ random inputs at each time step. 
In these figures, the curves represent theoretical results. The symbols represent simulation 
results. In these figures, Rb = 0.7 and q = 0.49 are common conditions. In addition, rj = 
0.3, a\ = 0.1, o-| = 0.2, and ct^ = 0.3 are conditions for Fig. 1. r] = 1.5, a\ = 0.01, (t| = 0.02, 
and aj = 0.03 are conditions for Fig. 2. 
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Fig. 1. Dynamical behaviors of generalization errors ejg when r] = 0.3. Theory and computer simu- 
lations. Rb = 0.7, q = 0.49, a\ = 0.1, cr| = 0.2, and aj = 0.3. (a)T = 5.0, (b)T = 2.0. 



These figures show that the dynamical behaviors of generalization error have the peri- 
odicity that is synchronized with the switching period T. In the case of K = 2, the stu- 
dent uses ensemble teachers as Bi — B2 Bi ^ B2 .In the case of K = 5, 
Bi — B2 — -B3 — B4 — B^ — Bi — B2 — B3 Therefore, by comparing the 
behaviors of K = 2 and that of = 5, the generarization errors ejg completely agree during 
the time corresponding to two cycles from the initial state because the teachers used by stu- 
dent are the same. On the contrary, the generarization errors ejg of K = 2 and K = 5 are 
not the same after the second cycle. In our study on the fast switching model,^ it was proven 
that when a student's learning rate satisfies r] < 1, the larger the number K is, the smaller 
the student's generalization error is. The same phenomenon is observed in the slow switching 
model treatd in this letter, that is, the generalization error of if = 5 is smaller than that of 
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K = 2 as shown in Fig. 1. On the contrary, the generahzation error of -R' = 5 is larger than 
that of X = 2 in Fig. 2. Here, the dynamical behavior approaches that of the fast switching 
model^ asymptotically in the limit of switching period T ^ 0. 
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Fig. 2. Dynamical behaviors of generalization errors ejg when 77 = 1.5. Theory and computer simu- 
lations. Rb = 0.7, q = 0.49, a\ = 0.01, cr| = 0.02, and crj = 0.03. (a)r = 1.0, (b)r = 0.5. 



In both cases of r? = 0.3 and 1.5, the smaller the switching period T is, the larger the 
difference between the generalization error ejg of K = 2 and that of = 5 is. The reason is 
the following: if the switching period T is large, a student learns enough from only the one 
teacher that the student uses in the period. In other words, as the student forgets the other 
teachers, the influence of the number K of ensemble teachers becomes small. 




Fig. 3. Student's projection to 2-D plane on which B1-B3 exist. {a)r] — 0.3, (b)'/] = 1.5. Solid lines 
represent trajectories of student's projection obtained theoretically. Symbols A and y represent 
computer simulations with (a)r — 2.0 and T — 5.0, (b)r — 0.5 and T = 1.0, respectively. 



We visualize the student's behaviors in the case of = 3 to understand them intuitively. 
That means we obtain the student's projection to the two-dimensional plane on which the 
three ensemble teachers exist. Figure 3 shows the projection's trajectories in the case of r/ = 0.3 
and r] = 1.5. In this figure, symbols x , o and solid lines represent the ensemble teachers Bi, B2 
and Bs, the projection of the true teacher A and the trajectories of the student's projection 
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obtained theoretically, respectively. In Fig. 3(a), symbols A and V represent the student's 
projections obtained by computer simulations with T = 2.0 and T = 5.0, respectively. In Fig. 
3(b), those represent the projections with T = 0.5 and T = 1.0, respectively. This figure shows 
that the student moves straight toward the teacher that the student uses then. Therefore, the 
student's trajectories in the steady state are regular triangles. The triangles are small when 
the switching period T is small and the triangles are large when T is large. In this figure, a 
side of the trajectory corresponds to a period in Figs. 1 and 2. Note that the distance between 
the student and the true teacher in Fig. 3 is not necessarily related to the real distance 
between the student and the true teacher nor the generalization error since this figure shows 
the projections. Though the student is near the true teacher when T is small in Fig. 3(b), the 
generalization error is small when T is large as shown in Fig. 2. 
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Fig. 4. Means of steady state generalization errors ejg. Theory, q = 0A9,Rb = 0.7 and cr\ = <t% = 
aj = 0.0. (a)T = 0.5, (b)r = 5.0. 



The relationships between the learning rate 77 and the means of steady state generalization 
errors ejg are shown in Fig. 4. The means are measured by averaging the generalization errors 
during a cycle after the dynamical behaviors reach the steady state. In this figure, when a 
learning rate satisfies rj < 1, the larger the number K is, the smaller the generalization error is. 
This is the same property with that of the fast switching model.^ A comparison of Figs. 4(a) 
and 4(b) shows that the smaller the switching period T is, the larger the difference among the 
means of generalization errors ejg of various K values in the slow switching model as treated 
in this letter. 

The relationships between the learning rate rj and the means of steady state generalization 
errors ejg for various direction cosines q are shown in Fig. 5. As shown in this figure, when 
a learning rate satisfies ij < 1, the smaller q is, the smaller the generalization error is. This 
is also the same property as that of the fast switching model.^ By comparing Figs. 5(a) and 
5(b), we see that the smaller the switching period T is, the larger the difference among the 
means of generalization errors ejg of various q. 

In summary, we have analyzed the generalization performance of a student in a model 
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Fig. 5. Means of steady state generalization errors ejg. Theory. K 
aj = 0.0. (a)r = 0.5, (b)r = 5.0. 



5, Rb = 0.7 and a\ = a\ 



composed of linear perceptions: a true teacher, ensemble teachers, and the student. In par- 
ticular, the case where the student slowly switches ensemble teachers has been analyzed. By 
calculating the generalization error analytically using statistical mechanics in the framework 
of on-line learning, we have shown that the dynamical behaviors of generalization error have 
the periodicity that is synchronized with the switching period and that the behaviors dif- 
fer with the number of ensemble teachers. Furthermore, we have shown that the smaller the 
switching period is, the larger the difference is. 
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